Skip to main content

SCALING A CSV IMPORTER

· 2 min read

SCALING A CSV IMPORTER

Spreadsheet programs like Google Sheets, Excel, Numbers, and a host of other programs can generate CSV formatted files. It is also commonplace for third-party email, social, or marketing tools to allow you to export and download CSV reports. These are all perfect candidates to generate CSV data for loading into a database, data warehouse, or data lake for further analysis.

  • In some cases, the CSV files can be real big ..Yes really big with a size in GBs. Scaling a CSV importer is a deep tech engineering problem.If the CSV file is in GBs then, the it's a real big data problem.But we tried to solve this bigdata problem with a completely different approach.

While building YoBulk's backend, we have encoutered multiple challenges when the CSV file size is in GB and it's not possible to upload and store the CSV file in browser. You can always choose a serverless approach and EMR/databrick kind of solution which scales automattically.

In any distributed computing system (even beyond Apache Spark), there exist well known scaling trends (runtime vs. number of nodes), as illustrated in the images below. These trends are universal and fundamental to computer science.

image As more and more nodes are added, the runtime of the job decreases, but the cost also increases so the bottomline is you can scale anything but you have pay the price to your cloud provider. At some point, adding more nodes has diminishing returns and the job stops running faster, but obviously cloud costs start rising (since more nodes are being added).

We could have easily choosen an architecture like below but it's not solving the real probelm rather passing the problem to a cloud infra. img2

But YoBulk does all the scaling in-prem and through a very optimized node streaming architecture.YoBulk does use mongoDB for storage and handle all concurreny and bulk importing without any serverless cloud and bigdata infrastructure.

Our intention::businesses should not make large tech investments(EMR,Databrick,ETL tools) for solving CSV importing problem.